Back

Protein Engineering, Design and Selection

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Protein Engineering, Design and Selection's content profile, based on 14 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.

1
Enhancing ML-based binder design with high-throughput screening: a comparison of mRNA and yeast display technologies

Yao, Z.; Metts, J. M.; Huber, A. K.; Li, J.; Kinjo, T.; Dieckhaus, H.; Nallathambi, A.; Bowers, A.; Kuhlman, B.

2026-02-14 bioengineering 10.64898/2026.02.12.705611 medRxiv
Top 0.1%
8.4%
Show abstract

Recent advances in machine learning (ML)-based protein design methods have enabled the rapid in silico generation of large libraries of miniprotein binders with minimal manual input. While computational design capacity has scaled rapidly, experimental validation methods have lagged, creating a bottleneck in binder discovery pipelines. Here, we apply mRNA display to screen an ML-designed miniprotein binder library and directly compare its performance with the more widely used yeast surface display platform using a single shared DNA library. We screened 2,009 designs targeting the platelet receptor TLT-1 and 3,159 designs targeting the immune receptor B7-H3 across both platforms. While both selection methods reliably identified functional binders, we found that mRNA display preferentially enriched binders with slower dissociation rates. In addition, mRNA display achieved higher library coverage than yeast display, likely rescuing functional designs that are penalized in a cell-based expression system. Biophysical characterization of selected binders from both platforms revealed strong binding affinities and high thermal stabilities. These results showcase the power of integrating ML-based computational design tools with rapid in vitro selection technologies, providing a scalable framework for therapeutic miniprotein discovery. IMPORTANCEMiniprotein binders offer major advantages as next-generation therapeutics, including small size, high stability, and efficient production. In this work, we conduct a side-by-side comparison of mRNA and yeast display as platforms for high-throughput evaluation of de novo miniprotein binders. The binders generated here serve as starting points for therapeutics targeting TLT-1 or B7-H3, two clinically relevant molecules.

2
FLIP2: Expanding Protein Fitness Landscape Benchmarks for Real-World Machine Learning Applications

Didi, K.; Alamdari, S.; Lu, A. X.; Wittmann, B.; Johnston, K. E.; Amini, A. P.; Madani, A. K.; Czeneszew, M.; Dallago, C.; Yang, K. K.

2026-02-24 bioengineering 10.64898/2026.02.23.707496 medRxiv
Top 0.1%
4.9%
Show abstract

Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of thermostability, binding, and viral capsid viability. We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, protein-protein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and suites reveals that simpler models often matched or outperformed fine-tuned protein language models on FLIP2, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data CC-BY 4.0 to facilitate continued progress.

3
Design to Data for mutants of β-glucosidase B from Paenibacillus polymyxa: Y333F, A88E, L219Q, A408H, Y173L, E340S, and Y422F

Maduros, A.; Farinsky, L.; Tagkopoulos, P.; Vater, A.; Siegel, J. B.

2026-02-05 biochemistry 10.64898/2026.02.04.703908 medRxiv
Top 0.1%
4.2%
Show abstract

This study explores computational design predictions related to experimental enzyme behavior by analyzing seven single-point mutants of {beta}-glucosidase B (BglB) from Paenibacillus polymyxa: Y333F, A88E, L219Q, A408H, Y173L, E340S, and Y422F. Each mutation was modeled using Foldit Standalone, and mutant selections were based on predicted thermodynamic stability changes of interest. Six of the seven mutants in this set yielded soluble, expressed protein. Most variants had similar catalytic efficiency compared to the wild type with one exception. The melting temperatures for most variants were also similar to the wild type. Correlation analysis revealed weak but potentially informative relationships between predicted {Delta}TSE and (a) thermal stability and (b) catalytic efficiency. These results further support known limitations of TSE score as a tool for single point mutation design and add to a growing dataset being generated to build the next generation of functionally predictive protein models.

4
Enzyme Classification via Semi-Supervised Functional ResidueLearning

Gong, C.; Zhang, D.; Ouyang-Zhang, J.; Liu, Q.; Klivans, A.; Diaz, D.

2026-02-14 bioengineering 10.64898/2026.02.11.705200 medRxiv
Top 0.1%
4.0%
Show abstract

Predicting enzymatic function from a protein sequence is a fundamental task in protein discovery and engineering. In this paper, we present Semi-supervised Learning for Enzyme Classification (SLEEC): a semi-supervised learning framework that learns a function-aware protein representation for Enzyme Commision (EC) number prediction. SLEEC achieves SOTA performance on standard bench-marks and provides interpretable, residue-level annotations. We further demonstrate that our framework is robust to benign sequence modifications routinely observed in protein engineering workflows- such as appending functional tags- a desirable property that current ML frameworks lack. Our main technical contribution is a multiple sequence alignment (MSA)-based data augmentation technique for discovering sparse residue activations within a given enzyme sequence.

5
CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

Chen, Y.; Fu, L.; Lu, X.; Li, W.; Gao, Y.; Wang, Y.; Ruan, Z.; Si, T.

2026-03-25 synthetic biology 10.64898/2026.03.24.714074 medRxiv
Top 0.1%
3.8%
Show abstract

Combinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearmans {rho} and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website (https://www.combingym.org), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.

6
What comes after de novo? Automated lead optimization of proteins with CRADLE-1

Bixby, E.; Brunner, G.; Danciu, D.; Dela Rosa, R.; Deutschmann, N.; Ferragu, C.; Geiger, F.; Holberg, C.; Kidger, P.; Lindoulsi, A.; Lutz, N.; McColgan, T.; Milius, S.; Shah, J.; Vandeloo, M.; Vidas, P.; Ziegler, J. D.; van Rossum, H.; van der Vorm, D.; Baldi, N.; IJSpeert, C.; Monza, E.; Schriek, A.

2026-03-08 bioengineering 10.64898/2026.03.06.710001 medRxiv
Top 0.1%
3.1%
Show abstract

Lead optimization remains the longest and most expensive step in pre-clinical drug discovery, typically consuming 12-36 months whilst costing $5M-$15M per candidate. We introduce O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP, an automated framework for protein engineering. While O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP supports the full process of drug discovery and industrial protein engineering pipelines, including hit identification and de novo binder design, this work focuses on its application to multi-property lead optimization across protein modalities (VHHs, scFvs, IgGs, peptides, enzymes, CRISPR systems, vaccines). We show it is 4-7x faster than rational design, as measured by the number of wet lab rounds required. We provide in-vitro validation across all of the above modalities, typically optimizing multiple properties simultaneously (single and polyspecific binding down to picomolar, activity, thermostability,...). Technically, O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP starts with pre-trained foundation protein language models (PLMs), which are fine-tuned in unsupervised fashion on evolutionary neighborhoods, in supervised fashion using lab-in-the-loop data, and then deployed in a multi-model workflow. Of additional interest, we find that (a) the end-to-end system may be run in automated fashion; (b) wet lab data may be consumed in black box fashion without knowledge of the underlying biochemical mechanisms; (c) structural data may largely be superseded by sequence-function pairs.

7
Scalable prediction of symmetric protein complex structures

Yu, V. S.; Demsko, P.; Castells-Graells, R.; Parker, H.; Huang, A.; Chen, C.; Huang, M.; Srinivasan, V.; Ajjarapu, K.; Tofighbakhsh, N.; Yu, R.; Lake, M.; Glanzman, D.; Warren, S.; Alzagatiti, J.

2026-02-05 bioengineering 10.1101/2025.11.14.688531 medRxiv
Top 0.1%
2.7%
Show abstract

All life relies on proteins to function, yet accurately modeling protein structures that exceed {approx} 10, 000 amino acids or have higher-order geometries remains difficult. Existing solutions are limited to specific scenarios, require considerable computational resources, or are otherwise unscalable. Consequently, many large, disease-relevant protein complexes in the human proteome, as well as nearly all viruses and numerous other classes, are impractical to model with high fidelity for drug development. To modulate these protein complexes and viruses, structural information is eminently valuable, and often essential. In the last two years, machine learning based-tools that can generate binders to a given target structure with high hit rates have emerged. Combined with high-throughput screening, these technologies can far outpace traditional drug discovery. However, they cannot function well without accurate models of their target structures. Thus, to unlock the full power of AI-driven drug discovery, a scalable method must be developed to predict large protein complex structures. To overcome this bottleneck, we introduce Plica-1, a physics-based method to rapidly and accurately predict the structure of arbitrarily large, symmetric protein complexes. Validated across 4 major symmetry classes (icosahedral, tetrahedral, octahedral, and cyclic), the method consistently achieves near-experimental levels of accuracy, i.e., RMSD < 5[A]. In test cases, the method runs in < 5 minutes on consumer hardware, 103-105 times faster than the closest comparable software. The largest structure currently built, at {approx}40,000 amino acids, is > 8 times the limit of existing machine learning methods. The results demonstrate that protein complexes can be modeled at significantly improved speeds and scales, making Plica-1 a promising tool for protein engineering and drug development.

8
Benchmarking and Experimental Validation of Machine Learning Strategies for Enzyme Engineering

Zeng, Z.; Jin, J.; Xu, R.; Luo, X.

2026-03-30 bioengineering 10.64898/2026.03.29.715152 medRxiv
Top 0.1%
2.6%
Show abstract

Enzyme-directed evolution increasingly relies on computational tools to prioritize mutations, yet their practical value is difficult to assess because kinetic data are often aggregated across heterogeneous assay conditions, inflating apparent generalization. Here we introduce EnzyArena, a curated benchmark that groups kinetic parameters (kcat, Km, kcat/Km) into condition-matched experimental subsets to enable realistic evaluation. Using this resource, we benchmark 10 representative models from two arising strategy families--zero-shot fitness prediction and supervised kinetic-parameter prediction--across BRENDA- and SABIO-RK-derived subsets and 25 independent mutagenesis datasets. Kinetic-parameter predictors perform strongly on database-derived subsets but lose their advantage on independent datasets, whereas zero-shot predictors show more consistent generalization. A simple consensus of multiple zero-shot models further improves the precision of identifying beneficial mutants. We prospectively validated these findings in a wet-lab campaign (150 mutants) comparing random mutants, UniKP-prioritized mutants and ESM-1v-prioritized mutants (representing supervised kinetic-parameter prediction and zero-shot fitness prediction, respectively), where ESM-1v achieved the highest utility and UniKP underperformed the random baseline. Together, this study establishes realistic baselines for computational mutant prioritization and highlights consensus zero-shot strategies as a practical starting point for enzyme engineering.

9
A Yeast Surface Display Platform for Screening Dimeric Mammalian Receptors

Slaton, E. W.; Krivanek, E. C.; Kimmel, B. R.

2026-01-30 synthetic biology 10.64898/2026.01.29.702702 medRxiv
Top 0.1%
2.1%
Show abstract

Discovering proteins that modulate receptor activity remains a key challenge in the field of protein design and engineering. Traditionally, identifying proteins that interact with receptors often relies on binding as a selection criterion, yielding limited information about the function of discovered binders in a library, including the ability to activate or block signaling cascades associated with the receptor of interest. As a result, extensive downstream characterization is required to assess the biological relevance of discovered binders. To address this issue, we have developed a high-throughput screening system to screen dimeric mammalian receptors using yeast surface display. We demonstrate the programmed dimerization of the extracellular domains of mammalian receptors in yeast via engineered induction pathways, thereby enabling receptor expression and the secretion of associated native cytokines. This surface expression of the involved subunits for the protein receptor and cytokine-induced dimerization activity indicates that the receptor has been activated and is expected to trigger a DNA-driven signaling cascade within a mammalian cell. This system provides a modular platform technology that advances existing yeast-display systems, demonstrating the effectiveness of these high-throughput platforms for screening the function of mammalian receptors. This work is expected to provide a rapid, cost-effective approach to the molecular discovery of novel biologics for targeting dimeric mammalian receptors.

10
High-Throughput FRET Affinity Screening Technique (HTFAST) For Cell-Free Expressed Binding Protein Characterization

Hejazi, S. S.; Noroozi, K.; Jurasic, V.; Jarboe, L. R.; Reuel, N. F.

2026-02-13 bioengineering 10.64898/2026.02.12.697512 medRxiv
Top 0.1%
1.7%
Show abstract

The rapid engineering of high-affinity binding proteins, such as nanobodies and single-domain antibodies (sdAbs), is increasingly driven by cell-free, machine-learning-guided optimization. However, high-throughput, quantitative characterization of binding affinity remains a major bottleneck, particularly for proteins expressed in cell-free systems without purification. Here, we present High-Throughput FRET Affinity Screening Technique (HTFAST) for rapid affinity characterization of binders expressed directly in crude E. coli cell-free protein synthesis reactions. HTFAST leverages Forster resonance energy transfer (FRET) between fluorescent-protein-fused binders and dye-labeled antigens to enable real-time, quantitative measurement of equilibrium dissociation constants. We systematically optimized fluorophore pairs used and labeling parameters using the SpyTag003-SpyCatcher003 model system. Using donor-quenching and acceptor-emission FRET analyses, HTFAST reliably quantified nanomolar binding affinities in crude lysates for SpyTag003-SpyCatcher003 model system. We validated the platform for nanobodies by characterizing a CD4-binding nanobody, Nb457, and benchmarking multiple SARS-CoV-2 receptor-binding domain sdAbs, demonstrating HTFASTs ability to rank binding strengths across a range of affinities. Finally, we demonstrate that both binding partners can be expressed directly in CFPS, further streamlining screening workflows. Overall, HTFAST provides a scalable, quantitative, and cell-free-compatible approach for high-throughput affinity screening, well suited for DBTL campaigns aimed at accelerating the development of next-generation binding proteins.

11
Amino acid variants at the P94 position in Staphylococcus aureus class A sortase modulate substrate binding and enzyme activity

Cox-Tigre, N.; Stewart, M. E.; Tucker, J.; Walkenhauer, E. G.; Wilce, C. S.; Antos, J. M.; Amacher, J. F.

2026-01-18 biochemistry 10.64898/2026.01.18.700168 medRxiv
Top 0.1%
1.6%
Show abstract

The surface of gram-positive bacteria is a highly regulated environment with specific attachment of proteins required for viability. Sortase enzymes are cysteine transpeptidases that recognize and ligate substrates to the peptidoglycan layer in these microorganisms, which can be highly pathogenic (e.g., Staphylococcus aureus, Streptococcus pyogenes, etc.). As such, sortases represent a potentially novel target for antibiotic development. In addition, the catalytic activity of sortase enzymes is utilized in sortase-mediated ligation (SML) engineering approaches for a variety of uses. In SML experiments, engineered variants of Staphylococcus aureus sortase A (saSrtA) are the most widely used enzymes. One of the mutated amino acids in the previously engineered pentamutant (or saSrtA5M) enzyme is P94. Structural analyses of experimental saSrtA structures revealed that P94 interacts directly with Y187 when saSrtA is in its inactive conformation. While saSrtA5M, developed via directed evolution, contains a P94R mutation, we wanted to interrogate this position further and ask if other single P94 mutations may reveal a greater effect on activity and/or substrate specificity. We created 18 P94X mutations (excluding P94C), and tested relative activity using a fluorescence resonance energy transfer (FRET) assay for 4 substrate sequences: LPATG, LPETG, LPKTG, and LPSTG. We identified several P94 variants that outperformed the single mutant P94R for all peptides tested, including P94A, P94D, P94E, P94G, P94H, P94N, P94Q, P94S, and P94T. We further observed that the reactivity of substrates with variations in the central position of the pentapeptide recognition motif (LPXTG) can be sensitive to the identity of the P94X residue. We tested P94A and P94D saSrtA5M variants and found that, depending on LPXTG sequence, these variants could outperform saSrtA5M in activity > 3-fold. Finally, we compared saSrtA5M and P94D saSrtA5M in a model sortase-mediated ligation reaction using a LPKTG substrate and saw [~]2-fold greater product formation. Taken together, we characterized an important position that modulates substrate access and activity in saSrtA. Furthermore, we argue that future studies which combine rational design and high throughput approaches, e.g., directed evolution, may result in sortase variants with increased SML potential.

12
Library docking for Cannabinoid-2 Receptor ligands

Rachman, M. M.; Iliopoulos-Tsoutsouvas, C.; Dominic Sacco, M.; Xu, X.; Wu, C.-G.; Santos, E.; Glenn, I. S.; Paris, L.; Cahill, M. K.; Ganapathy, S.; Tummino, T. A.; Moroz, Y. S.; Radchenko, D. S.; Okorie, M.; Tawfik, V. L.; Irwin, J. J.; Makriyannis, A.; Skiniotis, G.; Shoichet, B. K.

2026-03-21 biochemistry 10.64898/2026.03.19.713017 medRxiv
Top 0.1%
1.5%
Show abstract

Cannabinoid receptors are therapeutically promising GPCRs that are also interesting test systems for structure-based methods, which have targeted them previously. Here we used the CB2 receptor as a template to explore several topical questions in library docking. Whereas an earlier campaign against the CB1 receptor led to potent but relatively non-selective ligands, here we found that targeting interactions with polar, orthosteric site residues led to subtype-selective ligands. Docking hit rate and especially hit affinity improved in moving from a 7 million to a 2.6 billion molecule library. Similar to earlier studies, docking against active and inactive states of the receptor did not reliably bias toward the discovery of agonists or inverse agonists. Cryo-EM structures of two of the new agonists, each in a different chemotype, superposed well on the docking predictions. Correspondingly, structure-based optimization led to 10- to 140-fold improvements within three different series, also consistent with well-behaved ligand families. Hit rates with a fully enumerated 2.6 billion molecule library resembled those of an implied 11 billion molecule library from a building-block method, consistent with the latters ability to explore this space, though higher affinities were discovered from the fully enumerated set. Overall, eight diverse families of ligands, with potencies <100 nM and mostly unrelated to previously known ligands were found. Implications for future studies are considered.

13
Surface Display For Phage Assisted Continuous Evolution: A Platform For Evolving / Screening Nanobodies In Prokaryote Systems

Flores-Mora, F. E.; Brodsky, J.; Cerna, G. M.; Tse, A.; Hoover, R. L.; Bartelle, B. B.

2026-04-04 synthetic biology 10.64898/2026.04.03.716437 medRxiv
Top 0.1%
1.5%
Show abstract

Despite >50 years of methods development, specific antibodies are still generated at low throughput and remain in high demand across biotechnology. Most biologics and immunoprobes are monoclonal antibodies, developed using a combination of inoculating animals with a target antigen, engineered candidate libraries, and multiple rounds of selection using phage or yeast display. Here we introduce a synthetic biology scheme to eliminate the need for nearly all of these steps, by combining Surface display on E. coli and Phage display with the microvirus {Phi}X174, Assisting Continuous Evolution (SurPhACE). Instead of building libraries for screening, SurPhACE runs a closed evolutionary program. A typical experiment can have 1011 mutant candidates under active selection, with complete turnover of the mutant population every 30min, or >5x1012 unique mutants per day, using less than 100mL of bacterial culture media. We demonstrate SurPhACE for optimizing a nanobody to a related epitope, and develop novel nanobodies for an arbitrary target using a minimal starting library to establish a proof of concept and identify best practices for this scalable method for generating protein binders.

14
Data-efficient distal engineering of fluorinase using zero-shot models

Harding-Larsen, D.; Lax, B. M.; Weingarten, C. K.; Sako, A.; Mazurenko, S.; Welner, D. H.

2026-02-12 bioengineering 10.64898/2026.02.11.705267 medRxiv
Top 0.1%
1.3%
Show abstract

Fluorinases have high potential for industrial biofluorination but any applications have been precluded by low catalytic efficiency and resistance to active site engineering. In this work, we employed PRIZM, a computational workflow utilizing an existing low-N dataset and zero-shot models for in silico prediction of activity-enhancing mutations at distal sites. The combination of these predictions with expert opinion led to the identification of 21 fluorinase mutants with enhanced relative activities, while 3 variants showed increased melting temperatures. A mutation in the hexameric interface, K237R, resulted in the largest stability gain, a more than 3.2-fold improvement in catalytic efficiency at 57{degrees}C, and an 8-fold increase in relative activity at 62{degrees}C. These results highlight the potential of distal fluorinase engineering for improving properties required to realize its industrial applications.

15
An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.

2026-03-05 bioinformatics 10.64898/2026.03.04.709378 medRxiv
Top 0.1%
1.3%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.

16
CARTiBASE: an interactive knowledge base for CAR sequence retrieval and similarity analysis

Le Compte, G.; Ceylan, H.; Meysman, P.; Laukens, K.

2026-02-26 immunology 10.64898/2026.02.25.707638 medRxiv
Top 0.1%
1.3%
Show abstract

SummaryChimeric Antigen Receptors (CARs) are modular synthetic constructs that have transformed cellular immunotherapy, enabling targeted recognition and killing of malignant cells. Their clinical success has driven an explosive growth in new receptor designs, but these sequences are dispersed across heterogeneous sources such as publications, patents and supplementary files. This fragmentation and inconsistency limits comparative analysis, reproducibility and the reuse of existing constructs. To address this, we curated and standardized more than 10,000 CAR sequences into a single, harmonized resource. CARTiBASE is a web-based platform that provides standardized annotation, interactive browsing and fast similarity search across this curated collection. This unique database was leveraged to analyse the diversity in current CAR constructs within the public domain, revealing common design trends and lineages, as well as highlighting potential avenues for future CAR development. Availability and ImplementationCARTiBASE is freely available for non-commercial use at https://www.cartibase.org, without mandatory registration. The web server is implemented with a Python/Flask API backend and a Vue-based frontend and supports all major browsers. Users can search and filter thousands of CARs, inspect domain boundaries across signal peptide, antigen-binding domain, hinge, transmembrane, co-stimulatory and intracellular signaling regions, compare constructs and download sequences as FASTA files for downstream use.

17
GROQ-seq Enables Cross-site Reproducibility for High-Throughput Measurement of Protein Function

Spinner, A.; Ross, D.; Cortade, D.; Ikonomova, S.; Baranowski, C.; Dhroso, A.; Reider Apel, A.; Sheldon, K.; Duquette, C.; Kelly, P. J.; DeBenedictis, E.; Hudson, C.

2026-04-09 bioengineering 10.64898/2026.04.07.716961 medRxiv
Top 0.1%
1.0%
Show abstract

High-throughput functional assays are increasingly used to generate large-scale protein function datasets for protein engineering and machine learning applications. However, the utility of such datasets depends on the reproducibility of the underlying measurements. Here we report reproducible, quantitative measurements of protein sequence-to-function data at scale across two facilities. We analyze GROQ-seq (Growth-based Quantitative Sequencing) measurements of three bacterial transcription factors. Independent barcode measurements of the same sequence produce highly consistent functional estimates, demonstrating strong biological reproducibility (across all transcription factors the mean Root Mean Square Deviation [RMSD] {approx} 0.53 and mean Spearman {approx} 0.63). We also compared experiments performed at two facilities using a shared protocol, but with differing levels of automation and system integration. We observe strong agreement between measurements taken at the two sites (mean RMSD {approx} 0.41 and mean Spearman {approx} 0.730). Orthogonal tests further support this agreement: a classifier trained to distinguish data by site performs near random (AUC = 0.559), and top-ranking variants show strong statistical overlap between experiments. Together, these results demonstrate that GROQ-seq enables reproducible, scalable measurement of protein function suitable for large aggregated datasets.

18
Ligify 2.0: A web server for predicted small molecule biosensors

d'Oelsnitz, S.; Zhao, N. N.; Talla, P.; Jeong, J.; Love, J. D.; Springer, M.; Silver, P. A.

2026-02-08 bioengineering 10.1101/2025.10.20.683484 medRxiv
Top 0.1%
0.9%
Show abstract

Prokaryotic transcription factors (TFs) are used as small molecule biosensors with broad applications in biotechnology, yet only a small fraction from microbial genomes have been characterized. To address this gap, we recently described the bioinformatic method Ligify, which leverages information from genome context and enzyme reaction databases to predict a TFs cognate effector molecule. Here we report Ligify 2.0, a modern web server for Ligify predictions. We systematically evaluate 10,965 small molecules within the Rhea enzyme reaction database for associations to TFs, ultimately generating 13,435 hypothetical interactions between 1,362 small molecules and 3,164 TFs. We then develop an interactive web server (https://ligify.groov.bio) to search and visualize prediction data. Each TF sensor page includes visualizations for chemical ligand structures, interactive TF protein structures, and genome context. Pages also include metadata links, predicted promoter sequences, prediction confidence metrics, and references to relevant literature. A plasmid builder tool enables users to generate custom biosensor circuit designs. Finally, we provide case studies using Ligify 2.0 to identify two TFs from the pathogens Escherichia coli O157:H7 and Mycobacterium abscessus responsive to 4-hydroxybenzoate and Pseudomonas Quinolone Signal, respectively. The Ligify web server aims to facilitate the systematic characterization of biosensors for chemical-control of biological systems. Key pointsO_LILigify 2.0 contains >13,000 predicted transcription factor-small molecule interactions C_LIO_LIA rich web interface provides interactive visualizations and a plasmid design tool C_LIO_LIPredicted ligands for regulators from pathogenic bacteria are experimentally validated C_LI Graphic abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=70 SRC="FIGDIR/small/683484v2_ufig1.gif" ALT="Figure 1"> View larger version (24K): org.highwire.dtl.DTLVardef@1afa575org.highwire.dtl.DTLVardef@97c811org.highwire.dtl.DTLVardef@cfdb93org.highwire.dtl.DTLVardef@58977d_HPS_FORMAT_FIGEXP M_FIG C_FIG

19
Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data

Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.

2026-02-09 bioinformatics 10.64898/2026.02.06.704305 medRxiv
Top 0.1%
0.8%
Show abstract

Proteins can bind small molecules with high specificity. However, designing proteins that bind userdefined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.

20
Teaching Diffusion Models Physics: Reinforcement Learning for Physically Valid Diffusion-Based Docking

Broster, J. H.; Popovic, B.; Kondinskaia, D.; Deane, C. M.; Imrie, F.

2026-03-27 bioinformatics 10.64898/2026.03.25.714128 medRxiv
Top 0.1%
0.8%
Show abstract

Molecular docking aims to predict the binding conformation of a small molecule to its protein target. Recent work has proposed diffusion models for this task, from rigid-body docking that diffuses over ligand degrees of freedom to co-folding approaches that jointly generate protein structure and ligand pose. However, diffusion-based docking models have been shown to frequently produce physically implausible poses and fail to consistently recover key protein-ligand interactions. To address this, we introduce a reinforcement learning framework for training diffusion-based docking models directly on non-differentiable objectives. Fine-tuning DiffDock-Pocket for physical validity with our approach substantially increases the number of generated poses that are physically valid and interaction-preserving, with no increase in inference-time compute. Importantly, this comes without sacrificing structural accuracy; in fact, our approach increases the proportion of structures with near-native poses. These effects are most pronounced for protein targets that are dissimilar to the training data. Our fine-tuned DiffDock-Pocket model outperforms both classical docking algorithms and machine learning-based approaches on the PoseBusters set. Our results demonstrate that reinforcement learning can teach diffusion-based docking models to better respect physical constraints and recover key interactions, without the requirement to rely on inference-time corrections.